Time Series in R

Author

Arvind Venkatadri

Published

January 12, 2023

Introduction

Any metric that is measured over regular time intervals forms a time series. Analysis of Time Series is commercially important because of industrial need and relevance, especially with respect to Forecasting (Weather data, sports scores, population growth figures, stock prices, demand, sales, supply…). In the graph shown below is the temperature over times in two US cities:

A time series can be broken down to its components so as to systematically understand, analyze, model and forecast it. We have to begin by answering fundamental questions such as:

  1. What are the types of time series?
  2. How to decompose it? How to extract a level, a trend, and seasonal components from a time series?
  3. What is auto correlation etc.
  4. What is a stationary time series?
  5. And, how do you plot time series?

Introduction to Time Series: Data Formats

There are multiple formats for time series data.

  • Tibble format: the simplest is of course the standard tibble/ dataframe, with a time variable to indicate that the other variables vary with time.

  • The ts format: The stats::ts() function will convert a numeric vector into an R time series ts object.

  • The modern tsibble format: this is a new format for time series analysis, and is used by the tidyverts set of packages.

  • The base ts object is used by established packages forecast

  • The standard tibble object is used by timetk & modeltime

  • The special tsibble object (“time series tibble”) is used by fable, feasts and others from the tidyverts set of packages

Creating and Plotting Time Series

In this first example, we will use simple ts data first, and then do another with tsibble format, and then a third example with a tibble that we can plot as is and do more after conversion to tsibble format.

ts format data

There are a few datasets in base R that are in ts format already.

AirPassengers
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166
1952 171 180 193 181 183 218 230 242 209 191 172 194
1953 196 196 236 235 229 243 264 272 237 211 180 201
1954 204 188 235 227 234 264 302 293 259 229 203 229
1955 242 233 267 269 270 315 364 347 312 274 237 278
1956 284 277 317 313 318 374 413 405 355 306 271 306
1957 315 301 356 348 355 422 465 467 404 347 305 336
1958 340 318 362 348 363 435 491 505 404 359 310 337
1959 360 342 406 396 420 472 548 559 463 407 362 405
1960 417 391 419 461 472 535 622 606 508 461 390 432
str(AirPassengers)
 Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...

This can be easily plotted using base R:

plot(AirPassengers)

One can see that there is an upward trend and also seasonal variations that also increase over time.

Let us take data that is “time oriented” but not in ts format: the syntax of ts() is:

Syntax:  objectName <- ts(data, start, end, frequency)

where,

    `data`: represents the data vector
    `start`: represents the first observation in time series
    `end`: represents the last observation in time series
    `frequency`: represents number of observations per unit time. For 
    example 1=annual, 4=quarterly, 12=monthly, etc.

We will pick simple numerical vector data variable from trees:

trees
   Girth Height Volume
1    8.3     70   10.3
2    8.6     65   10.3
3    8.8     63   10.2
4   10.5     72   16.4
5   10.7     81   18.8
6   10.8     83   19.7
7   11.0     66   15.6
8   11.0     75   18.2
9   11.1     80   22.6
10  11.2     75   19.9
11  11.3     79   24.2
12  11.4     76   21.0
13  11.4     76   21.4
14  11.7     69   21.3
15  12.0     75   19.1
16  12.9     74   22.2
17  12.9     85   33.8
18  13.3     86   27.4
19  13.7     71   25.7
20  13.8     64   24.9
21  14.0     78   34.5
22  14.2     80   31.7
23  14.5     74   36.3
24  16.0     72   38.3
25  16.3     77   42.6
26  17.3     81   55.4
27  17.5     82   55.7
28  17.9     80   58.3
29  18.0     80   51.5
30  18.0     80   51.0
31  20.6     87   77.0
# Choosing the `height` variable
trees_ts <- ts(trees$Height, 
               frequency = 1, # No reason to believe otherwise
               start = 1980)  # Arbitrarily picked "1980" !
plot(trees_ts)

( Note that this example is just for demonstration: tree heights do not decrease over time!!)

tsibble data

The package tsibbledata contains several ready made tsibble format data. Run data(package = "tsibbledata") to find out about these. Let us try PBS which is a dataset containing Monthly Medicare prescription data in Australia.

data("PBS")

This is a large dataset, with 1M observations, for 336 combinations of key variables. Data appears to be monthly. Note that there is more than one quantitative variable, which one would not be able to support in the ts format.

There are multiple Quantitative variables ( Scripts and Cost). The Qualitative Variables are described below. (Type help("PBS") in your Console)

The data is disaggregated using four keys:

Concession: Concessional scripts are given to pensioners, unemployed, dependents, and other card holders
Type: Co-payments are made until an individual’s script expenditure hits a threshold ($290.00 for concession, $1141.80 otherwise). Safety net subsidies are provided to individuals exceeding this amount.
ATC1: Anatomical Therapeutic Chemical index (level 1) ATC2: Anatomical Therapeutic Chemical index (level 2)

Let us simply plot Cost over time:

PBS %>% ggplot(aes(x = Month, y = Cost)) + 
  geom_point() + 
  geom_line()

This basic plot is quite messy. We ought to use dplyr to filter the data using some combination of the Qualitative variables( 336 combinations!). Let us try that now:

PBS %>% count(ATC1, ATC2, Concession, Type)
# A tibble: 336 × 5
   ATC1  ATC2  Concession   Type            n
   <chr> <chr> <chr>        <chr>       <int>
 1 A     A01   Concessional Co-payments   204
 2 A     A01   Concessional Safety net    204
 3 A     A01   General      Co-payments   204
 4 A     A01   General      Safety net    204
 5 A     A02   Concessional Co-payments   204
 6 A     A02   Concessional Safety net    204
 7 A     A02   General      Co-payments   204
 8 A     A02   General      Safety net    204
 9 A     A03   Concessional Co-payments   204
10 A     A03   Concessional Safety net    204
# … with 326 more rows

We have 336 combinations of Qualitative variables, each containing 204 observations: so let us filter for a few such combinations and plot:

PBS %>% dplyr::filter(Concession == "General", 
                      ATC1 == "A",
                      ATC2 == "A10") %>% 
  ggplot(aes(x = Month, y = Cost, colour = Type)) + 
  geom_line() + 
  geom_point()

As can be seen, very different time patterns based on the two Types of payment methods. Strongly seasonal for both, with seasonal variation increasing over the years, but there is an upward trend with the Co-payments method of payment.

tibble data

Let us read and inspect in the US births data from 2000 to 2014. Download this data by clicking on the icon below, and saving the downloaded file in a sub-folder called data inside your project:

Read this data in:

Rows: 5,479
Columns: 5
$ year          <dbl> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ month         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ date_of_month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
$ day_of_week   <dbl> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3,…
$ births        <dbl> 9083, 8006, 11363, 13032, 12558, 12466, 12516, 8934, 794…

So there are several numerical variables for year, month, and day_of_month, day_of_week, and of course the births on a daily basis. We will create a date column with these separate ones above, and then plot the births, say for the month of March, in each year:

# A tsibble: 5,479 x 4 [1D]
   date       births date_of_month day_of_week
   <date>      <dbl>         <dbl>       <dbl>
 1 2000-01-01   9083             1           6
 2 2000-01-02   8006             2           7
 3 2000-01-03  11363             3           1
 4 2000-01-04  13032             4           2
 5 2000-01-05  12558             5           3
 6 2000-01-06  12466             6           4
 7 2000-01-07  12516             7           5
 8 2000-01-08   8934             8           6
 9 2000-01-09   7949             9           7
10 2000-01-10  11668            10           1
# … with 5,469 more rows

Hmm…can we try to plot box plots over time (Candle-Stick Plots)? Over month / quarter or year?

# Monthly box plots
births_tsibble %>%
  index_by(month_index = ~ yearmonth(.)) %>% # 180 months over 15 years
  # No need to summarise, since we want boxplots per year / month
  ggplot(., aes(y = births, x = date, 
                group =  month_index)) + # plot the groups
  
  geom_boxplot(aes(fill = month_index))      # 180 plots!!  

# Quarterly boxplots
births_tsibble %>%
  index_by(qrtr_index = ~ yearquarter(.)) %>% # 60 quarters over 15 years
  # No need to summarise, since we want boxplots per year / month
  ggplot(., aes(y = births, x = date, 
                group = qrtr_index)) +
  
  geom_boxplot(aes(fill = qrtr_index))        # 60 plots!!

# Yearwise boxplots
births_tsibble %>% 
  index_by(year_index = ~ lubridate::year(.)) %>% # 15 years, 15 groups
    # No need to summarise, since we want boxplots per year / month

  ggplot(., aes(y = births, 
                x = date, 
                group = year_index)) + # plot the groups
  
  geom_boxplot(aes(fill = year_index)) +           # 15 plots
  scale_fill_distiller(palette = "Spectral")

Although the graphs are very busy, they do reveal seasonality trends at different periods.

Case Study -1: Walmart Sales Dataset from timetk

Let us inspect what datasets are available in the package timetk. Type data(package = "timetk") in your Console to see what datasets are available.

Let us choose the Walmart Sales dataset. See here for more details: Walmart Recruiting - Store Sales Forecasting |Kaggle

data("walmart_sales_weekly")
walmart_sales_weekly
# A tibble: 1,001 × 17
   id    Store  Dept Date       Weekly_Sa…¹ IsHol…² Type    Size Tempe…³ Fuel_…⁴
   <fct> <dbl> <dbl> <date>           <dbl> <lgl>   <chr>  <dbl>   <dbl>   <dbl>
 1 1_1       1     1 2010-02-05      24924. FALSE   A     151315    42.3    2.57
 2 1_1       1     1 2010-02-12      46039. TRUE    A     151315    38.5    2.55
 3 1_1       1     1 2010-02-19      41596. FALSE   A     151315    39.9    2.51
 4 1_1       1     1 2010-02-26      19404. FALSE   A     151315    46.6    2.56
 5 1_1       1     1 2010-03-05      21828. FALSE   A     151315    46.5    2.62
 6 1_1       1     1 2010-03-12      21043. FALSE   A     151315    57.8    2.67
 7 1_1       1     1 2010-03-19      22137. FALSE   A     151315    54.6    2.72
 8 1_1       1     1 2010-03-26      26229. FALSE   A     151315    51.4    2.73
 9 1_1       1     1 2010-04-02      57258. FALSE   A     151315    62.3    2.72
10 1_1       1     1 2010-04-09      42961. FALSE   A     151315    65.9    2.77
# … with 991 more rows, 7 more variables: MarkDown1 <dbl>, MarkDown2 <dbl>,
#   MarkDown3 <dbl>, MarkDown4 <dbl>, MarkDown5 <dbl>, CPI <dbl>,
#   Unemployment <dbl>, and abbreviated variable names ¹​Weekly_Sales,
#   ²​IsHoliday, ³​Temperature, ⁴​Fuel_Price
inspect(walmart_sales_weekly)

categorical variables:  
       name     class levels    n missing
1        id    factor   3331 1001       0
2 IsHoliday   logical      2 1001       0
3      Type character      1 1001       0
                                   distribution
1 1_1 (14.3%), 1_3 (14.3%), 1_8 (14.3%) ...    
2 FALSE (93%), TRUE (7%)                       
3 A (100%)                                     

Date variables:  
  name class      first       last min_diff max_diff    n missing
1 Date  Date 2010-02-05 2012-10-26   0 days   7 days 1001       0

quantitative variables:  
           name   class         min          Q1      median          Q3
1         Store numeric      1.0000      1.0000      1.0000      1.0000
2          Dept numeric      1.0000      3.0000     13.0000     93.0000
3  Weekly_Sales numeric   6165.7300  28257.3000  39886.0600  77943.5700
4          Size numeric 151315.0000 151315.0000 151315.0000 151315.0000
5   Temperature numeric     35.4000     57.7900     69.6400     80.4900
6    Fuel_Price numeric      2.5140      2.7590      3.2900      3.5940
7     MarkDown1 numeric    410.3100   4039.3900   6154.1400  10121.9700
8     MarkDown2 numeric      0.5000     40.4800    144.8700   1569.0000
9     MarkDown3 numeric      0.2500      6.0000     25.9650    101.6400
10    MarkDown4 numeric      8.0000    577.1400   1822.5500   3750.5900
11    MarkDown5 numeric    554.9200   3127.8800   4325.1900   6222.2500
12          CPI numeric    210.3374    211.5312    215.4599    220.6369
13 Unemployment numeric      6.5730      7.3480      7.7870      7.8380
           max         mean           sd    n missing
1       1.0000 1.000000e+00 0.000000e+00 1001       0
2      95.0000 3.585714e+01 3.849159e+01 1001       0
3  148798.0500 5.464634e+04 3.627627e+04 1001       0
4  151315.0000 1.513150e+05 0.000000e+00 1001       0
5      91.6500 6.830678e+01 1.420767e+01 1001       0
6       3.9070 3.219699e+00 4.260286e-01 1001       0
7   34577.0600 8.090766e+03 6.550983e+03  357     644
8   46011.3800 2.941315e+03 7.873661e+03  294     707
9   55805.5100 1.225400e+03 7.811934e+03  350     651
10  32403.8700 3.746085e+03 5.948867e+03  357     644
11  20475.3200 5.018655e+03 3.254071e+03  357     644
12    223.4443 2.159969e+02 4.337818e+00 1001       0
13      8.1060 7.610420e+00 3.825958e-01 1001       0
glimpse(walmart_sales_weekly)
Rows: 1,001
Columns: 17
$ id           <fct> 1_1, 1_1, 1_1, 1_1, 1_1, 1_1, 1_1, 1_1, 1_1, 1_1, 1_1, 1_…
$ Store        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ Dept         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ Date         <date> 2010-02-05, 2010-02-12, 2010-02-19, 2010-02-26, 2010-03-…
$ Weekly_Sales <dbl> 24924.50, 46039.49, 41595.55, 19403.54, 21827.90, 21043.3…
$ IsHoliday    <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
$ Type         <chr> "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A…
$ Size         <dbl> 151315, 151315, 151315, 151315, 151315, 151315, 151315, 1…
$ Temperature  <dbl> 42.31, 38.51, 39.93, 46.63, 46.50, 57.79, 54.58, 51.45, 6…
$ Fuel_Price   <dbl> 2.572, 2.548, 2.514, 2.561, 2.625, 2.667, 2.720, 2.732, 2…
$ MarkDown1    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ MarkDown2    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ MarkDown3    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ MarkDown4    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ MarkDown5    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ CPI          <dbl> 211.0964, 211.2422, 211.2891, 211.3196, 211.3501, 211.380…
$ Unemployment <dbl> 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 8.106, 7…
# Try this in your Console
# help("walmart_sales_weekly")

The data is described as:

A tibble: 9,743 x 3

  • id Factor. Unique series identifier (4 total)
  • Store Numeric. Store ID.
  • Dept Numeric. Department ID.
  • Date Date. Weekly timestamp.
  • Weekly_Sales Numeric. Sales for the given department in the given store.
  • IsHoliday Logical. Whether the week is a “special” holiday for the store.
  • Type Character. Type identifier of the store.
  • Size Numeric. Store square-footage
  • Temperature Numeric. Average temperature in the region.
  • Fuel_Price Numeric. Cost of fuel in the region.
  • MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5 Numeric. Anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
  • CPI Numeric. The consumer price index.
  • Unemployment Numeric. The unemployment rate in the region.

Very cool to know that mosaic::inspect() identifies date variables separately!

NOTE: 1. This is still a data.frame, with a time-oriented variable of course, and not yet a time-series object. Note that this data frame has the YMD columns repeated for each Dept. 2. The Date column has repeated entries for each Dept! To deal with this repetition, we will always need to split the Weekly_Sales by the Dept column before we plot or analyze.

Since our sales are weekly, we will convert Date to yearweek format:

walmart_time <- walmart_sales_weekly %>% 
  # mutate(Date = as.Date(Date)) %>% 
  as_tsibble(index = Date, # Time Variable
             key = Dept 
             
  #  Identifies unique "subject" who are measures
  #  All other variables such as Weekly_sales become "measured variable"
  #  Each observation should be uniquely identified by index and key

             )
walmart_time
# A tsibble: 1,001 x 17 [7D]
# Key:       Dept [7]
   id    Store  Dept Date       Weekly_Sa…¹ IsHol…² Type    Size Tempe…³ Fuel_…⁴
   <fct> <dbl> <dbl> <date>           <dbl> <lgl>   <chr>  <dbl>   <dbl>   <dbl>
 1 1_1       1     1 2010-02-05      24924. FALSE   A     151315    42.3    2.57
 2 1_1       1     1 2010-02-12      46039. TRUE    A     151315    38.5    2.55
 3 1_1       1     1 2010-02-19      41596. FALSE   A     151315    39.9    2.51
 4 1_1       1     1 2010-02-26      19404. FALSE   A     151315    46.6    2.56
 5 1_1       1     1 2010-03-05      21828. FALSE   A     151315    46.5    2.62
 6 1_1       1     1 2010-03-12      21043. FALSE   A     151315    57.8    2.67
 7 1_1       1     1 2010-03-19      22137. FALSE   A     151315    54.6    2.72
 8 1_1       1     1 2010-03-26      26229. FALSE   A     151315    51.4    2.73
 9 1_1       1     1 2010-04-02      57258. FALSE   A     151315    62.3    2.72
10 1_1       1     1 2010-04-09      42961. FALSE   A     151315    65.9    2.77
# … with 991 more rows, 7 more variables: MarkDown1 <dbl>, MarkDown2 <dbl>,
#   MarkDown3 <dbl>, MarkDown4 <dbl>, MarkDown5 <dbl>, CPI <dbl>,
#   Unemployment <dbl>, and abbreviated variable names ¹​Weekly_Sales,
#   ²​IsHoliday, ³​Temperature, ⁴​Fuel_Price

Basic Time Series Plots

The easiest way is to use autoplot from the feasts package. You may need to specify the actual measured variable, if there is more than one numerical column:

autoplot(walmart_time,.vars = Weekly_Sales)

timetk gives us interactive plots that may be more evocative than the static plot above. The basic plot function with timetk is plot_time_series. There are arguments for the date variable, the value you want to plot, colours, groupings etc.

Let us explore this dataset using timetk, using our trusted method of asking Questions:

Q.1 How are the weekly sales different for each Department?

There are 7 number of Departments. So we should be fine plotting them and also facetting with them, as we will see in a bit:

walmart_time %>%   timetk::plot_time_series(Date, Weekly_Sales,
                   .color_var = Dept, .legend_show = TRUE,
                   .title = "Walmart Sales Data by Department",
                   .smooth = FALSE)

Q.2. What do the sales per Dept look like during the month of December (Christmas time) in 2012? Show the individual Depts as facets.

We can of course zoom into the interactive plot above, but if we were to plot it anyway:

# Only include rows from  1 to December 31, 2011
# Data goes only up to Oct 2012

walmart_time %>% 
  # Each side of the time_formula is specified as the character 'YYYY-MM-DD HH:MM:SS',
  timetk::filter_by_time(.date_var = Date,
                         .start_date = "2011-12-01",
                         .end_date = "2011-12-31") %>%

  plot_time_series(.date_var = Date, 
                   .value = Weekly_Sales, 
                   .color_var = Dept, 
                   .facet_vars = Dept, 
                   .facet_ncol = 2,
                   .smooth = FALSE) # Only 4 points per graph

Clearly the “unfortunate” Dept#13 has seen something of a Christmas drop in sales, as has Dept#38 ! The rest, all is well, it seems…

Too much noise? How about some averaging?

Q.3 How do we smooth out some of the variations in the time series to be able to understand it better?

Sometimes there is too much noise in the time series observations and we want to take what is called a rolling average. For this we will use the function timetk::slidify to create an averaging function of our choice, and then apply it to the time series using regular dplyr::mutate

# Let's take the average of Sales for each month in each Department.
# Our **function** will be named "rolling_avg_month": 

rolling_avg_month = slidify(.period = 4, # every 4 weeks
                            .f = mean, # The funtion to average
                            .align = "center", # Aligned with middle of month
                            .partial = TRUE) # TO catch any leftover half weeks
rolling_avg_month
function (...) 
{
    slider_2(..., .slider_fun = slider::pslide, .f = .f, .period = .period, 
        .align = .align, .partial = .partial, .unlist = .unlist)
}
<bytecode: 0x000001be0ef97078>
<environment: 0x000001be0ef97af8>

OK, slidify creates a function! Let’s apply it to the Walmart Sales time series…

walmart_time %>% 
  # group_by(Dept) %>% 
  mutate(avg_monthly_sales = rolling_avg_month(Weekly_Sales)) %>% 
  # ungroup() %>% 
  timetk::plot_time_series(Date, avg_monthly_sales,.color_var = Dept, .smooth = FALSE)

Curves are smoother now. Need to check whether the averaging was done on a per-Dept basis…should we have had a group_by(Dept) before the averaging, and ungroup() before plotting? Try it !!

Case Study-2: Dataset from nycflights13

Let us try the flights dataset from the package nycflights13. Try data(package = "nycflights13") in your Console.

We have the following datasets innycflights13:

Data sets in package nycflights13:

  • airlines Airline names.
  • airports Airport metadata
  • flights Flights data
  • planes Plane metadata.
  • weather Hourly weather data

Let us analyze the flights data:

data("flights", package = "nycflights13")
mosaic::inspect(flights)

categorical variables:  
     name     class levels      n missing
1 carrier character     16 336776       0
2 tailnum character   4043 334264    2512
3  origin character      3 336776       0
4    dest character    105 336776       0
                                   distribution
1 UA (17.4%), B6 (16.2%), EV (16.1%) ...       
2 N725MQ (0.2%), N722MQ (0.2%) ...             
3 EWR (35.9%), JFK (33%), LGA (31.1%)          
4 ORD (5.1%), ATL (5.1%), LAX (4.8%) ...       

quantitative variables:  
             name   class  min   Q1 median   Q3  max        mean          sd
1            year integer 2013 2013   2013 2013 2013 2013.000000    0.000000
2           month integer    1    4      7   10   12    6.548510    3.414457
3             day integer    1    8     16   23   31   15.710787    8.768607
4        dep_time integer    1  907   1401 1744 2400 1349.109947  488.281791
5  sched_dep_time integer  106  906   1359 1729 2359 1344.254840  467.335756
6       dep_delay numeric  -43   -5     -2   11 1301   12.639070   40.210061
7        arr_time integer    1 1104   1535 1940 2400 1502.054999  533.264132
8  sched_arr_time integer    1 1124   1556 1945 2359 1536.380220  497.457142
9       arr_delay numeric  -86  -17     -5   14 1272    6.895377   44.633292
10         flight integer    1  553   1496 3465 8500 1971.923620 1632.471938
11       air_time numeric   20   82    129  192  695  150.686460   93.688305
12       distance numeric   17  502    872 1389 4983 1039.912604  733.233033
13           hour numeric    1    9     13   17   23   13.180247    4.661316
14         minute numeric    0    8     29   44   59   26.230100   19.300846
        n missing
1  336776       0
2  336776       0
3  336776       0
4  328521    8255
5  336776       0
6  328521    8255
7  328063    8713
8  336776       0
9  327346    9430
10 336776       0
11 327346    9430
12 336776       0
13 336776       0
14 336776       0

time variables:  
       name   class               first                last min_diff   max_diff
1 time_hour POSIXct 2013-01-01 05:00:00 2013-12-31 23:00:00   0 secs 25200 secs
       n missing
1 336776       0

We have time-related columns; Apart from year, month, day we have time_hour; and time-event numerical data such as arr_delay (arrival delay) and dep_delay (departure delay). We also have categorical data such as carrier, origin, dest, flight and tailnum of the aircraft. It is also a large dataset containing 330K entries. Enough to play with!!

Let us replace the NAs in arr_delay and dep_delay with zeroes for now, and convert it into a time-series object with tsibble:

flights_delay_ts <- flights %>% 
  
  mutate(arr_delay = replace_na(arr_delay, 0), 
         dep_delay = replace_na(dep_delay, 0)) %>% 
  
  select(time_hour, arr_delay, dep_delay, carrier, origin, dest, flight, tailnum) %>% 
  tsibble::as_tsibble(index = time_hour, 
                      # All the remaining identify unique entries
                      # Along with index
                      # Many of these variables are common
                      # Need *all* to make unique entries!
                      key = c(carrier, origin, dest,flight, tailnum), 
                      validate = TRUE) # Making sure each entry is unique


flights_delay_ts
# A tsibble: 336,776 x 8 [1h] <America/New_York>
# Key:       carrier, origin, dest, flight, tailnum [186,870]
   time_hour           arr_delay dep_delay carrier origin dest  flight tailnum
   <dttm>                  <dbl>     <dbl> <chr>   <chr>  <chr>  <int> <chr>  
 1 2013-05-04 07:00:00       -11        -6 9E      EWR    ATL     3633 N170PQ 
 2 2013-05-01 11:00:00       -24        -5 9E      EWR    ATL     4194 N170PQ 
 3 2013-05-03 09:00:00        15        -6 9E      EWR    ATL     4362 N170PQ 
 4 2013-05-02 09:00:00        -5        -6 9E      EWR    ATL     4362 N232PQ 
 5 2013-12-12 17:00:00        13        10 9E      EWR    CVG     3323 N8506C 
 6 2013-12-05 17:00:00         0         0 9E      EWR    CVG     3323 <NA>   
 7 2013-12-18 17:00:00       -13         1 9E      EWR    CVG     3335 N8412F 
 8 2013-12-11 17:00:00        -7        -8 9E      EWR    CVG     3335 N8505Q 
 9 2013-12-13 17:00:00        30        31 9E      EWR    CVG     3335 N8672A 
10 2013-12-16 17:00:00       -30       -14 9E      EWR    CVG     3335 N8877A 
# … with 336,766 more rows

Q.1. Plot the monthly average arrival delay by carrier

mean_arr_delays_by_carrier <- 
  flights_delay_ts %>%
  group_by(carrier) %>% 
  
  index_by(month = ~ yearmonth(.)) %>% 
  # index_by uses (year, yearquarter, yearmonth, yearweek, as.Date)
  # to create a new column to show the time-grouping
  # year / quarter / month/ week, or day...
  # which IS different from traditional dplyr
  
  summarise(mean_arr_delay = mean(arr_delay, na.rm = TRUE)
  )

mean_arr_delays_by_carrier
# A tsibble: 185 x 3 [1M]
# Key:       carrier [16]
   carrier    month mean_arr_delay
   <chr>      <mth>          <dbl>
 1 9E      2013 Jan           9.60
 2 9E      2013 Feb           7.61
 3 9E      2013 Mar           1.89
 4 9E      2013 Apr           5.07
 5 9E      2013 May           9.85
 6 9E      2013 Jun          19.7 
 7 9E      2013 Jul          21.3 
 8 9E      2013 Aug           5.01
 9 9E      2013 Sep          -6.81
10 9E      2013 Oct          -1.32
# … with 175 more rows
mean_arr_delays_by_carrier %>%
  timetk::plot_time_series(
    .date_var = month,
    .value = mean_arr_delay,
    .facet_vars = carrier,
    .smooth = FALSE,
    # .smooth_degree = 1,
    
    # keep .smooth off since it throws warnings if there are too few points
    # Like if we do quarterly or even yearly summaries
    # Use only for smaller values of .smooth_degree (0,1)
    #
    .facet_ncol = 4,
    .title = "Average Monthly Arrival Delays by Carrier"
  )

Q.2. Plot a candlestick chart for total flight delays by month for each carrier

flights_delay_ts %>% 
  mutate(total_delay = arr_delay + dep_delay) %>%
  timetk::plot_time_series_boxplot(
    .date_var = time_hour,
    .value = total_delay,
    .color_var = origin,
    .facet_vars = origin,
    .period = "month",
  # same warning again
    .smooth = FALSE
  )

Q.2. Plot a heatmap chart for total flight delays by origin, aggregated by month

avg_delays_month <- flights_delay_ts %>% 
  group_by(origin) %>% 
  mutate(total_delay = arr_delay + dep_delay) %>% 
  index_by(month = ~ yearmonth(.)) %>% 
  # index_by uses (year, yearquarter, yearmonth, yearweek, as.Date)
  # to create a new column to show the time-grouping
  # year / quarter / month/ week, or day...
  # which IS different from traditional dplyr
    summarise(mean_monthly_delay = mean(total_delay, na.rm = TRUE)
  )

avg_delays_month 
# A tsibble: 36 x 3 [1M]
# Key:       origin [3]
   origin    month mean_monthly_delay
   <chr>     <mth>              <dbl>
 1 EWR    2013 Jan              27.0 
 2 EWR    2013 Feb              20.6 
 3 EWR    2013 Mar              27.7 
 4 EWR    2013 Apr              30.7 
 5 EWR    2013 May              20.2 
 6 EWR    2013 Jun              37.8 
 7 EWR    2013 Jul              36.4 
 8 EWR    2013 Aug              19.8 
 9 EWR    2013 Sep               2.54
10 EWR    2013 Oct              11.1 
# … with 26 more rows
# three origins 12 months therefore 36 rows
# Tsibble index_by + summarise also gives us a  month` column 



ggformula::gf_tile(origin ~ month, fill = ~ mean_monthly_delay, 
                   color = "black", data = avg_delays_month,
                   title = "Mean Flight Delays from NY Airports in 2013") %>% 
  gf_theme(theme_classic()) %>% 
  gf_theme(scale_fill_viridis_c(option = "A")) %>% 
  
# "magma" (or "A") inferno" (or "B") "plasma" (or "C") 
# "viridis" (or "D") "cividis" (or "E") 
# "rocket" (or "F") "mako" (or "G") "turbo" (or "H")

  gf_theme(scale_x_time(breaks = 1:12, labels = month.abb))

Conclusion

We can plot most of the common Time Series Plots with the help of the tidyverts packages: ( tsibble, feast, fable and fabletools) , along with timetk and ggformula.

There are other plot packages to investigate, such as dygraphs

Recall that we have used the tsibble format for the data. There are other formats such as ts, xts and others which are meant for time series analysis. But for our present purposes, we are able to do things with the capabilities of timetk.

References

  1. Rob J Hyndman and George Athanasopoulos, Forecasting: Principles and Practice (3rd ed), Available Online https://otexts.com/fpp3/